Skip to content

BUG fixing module first() and DateOffset #52487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

DeaMariaLeon
Copy link
Member

@DeaMariaLeon DeaMariaLeon commented Apr 6, 2023

Also fixes part of #51284 (I have not worked on last())

Edge cases were only fixed for offsets passed as string.
Trying to make it work for DateOffset would make it more complicated. Should this be fixed or should DateOffset be removed from the function (and docs)?
Will DataOffset be deprecated or fixed? it has bugs or the documentation is incorrect.

As commented by @MarcoGorelli:

  • Functions first() and last() assume that the Series or DataFrame have a sorted DatetimeIndex. Only last() has it specified in the documentation.

  • first() could be easily done as follows (in case it's deprecated):
    df[df.index < df.index[0] + DateOffset(days= 15)]

I also tried to use first() with relativedelta but I couldn’t make it work.

Edit:

first() doesn’t work for DateOffset. With this PR DateOffset works, but it now seems that there are inconsistencies when given a string or MonthEnd.
I think that since the PRs to fix #29623, first() with a string doesn’t work as normal offset arithmetic.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

I'm not super-familiar with offsets but I left a couple of comments, will get back to this

offset = to_offset(offset)
if not isinstance(offset, Tick) and offset.is_on_offset(self.index[0]):
if type(offset) is offsets.DateOffset and offset.is_on_offset(self.index[0]):
end_date = self.index[0] - Timedelta(1, unit="d") + offset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can hard-code Timedelta(1, unit='d'), e.g. it would break DateOffset(minutes=2):

(Pdb) p self.index[0]
Timestamp('2018-04-09 00:00:00')
(Pdb) p self.index[0] - Timedelta(1, unit='d')
Timestamp('2018-04-08 00:00:00')
(Pdb) p self.index[0] - Timedelta(1, unit='d') + offset
Timestamp('2018-04-08 00:02:00')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review!
Yes, we should not hard-code like that, I did it as a draft. I wanted to show how different the behavior of DateOffset is from the "Tick" object.
The whole first() was not implemented for DateOffset, there were no tests either.
Then there are bugs on it (on DateOffset itself).
In the end, I wanted to know if patching first() was correct, if DateOffset will be deprecated or if it should be fixed.

Some of the issues of DateOffset:

  1. DateOffset.is_on_offset is different than the offset done with a string.
  2. DateOffset.name doesn't work
  3. At some point adding an offset to it, was going backwards! (as in self.index[0] + offset = earlier date)

@MarcoGorelli
Copy link
Member

Functions first() and last() assume that the Series or DataFrame have a sorted DatetimeIndex. Only last() has it specified in the documentation.

do you want to start by making a docs PR to fix this? I think this would need doing regardless of whether DateOffset were supported

@DeaMariaLeon
Copy link
Member Author

do you want to start by making a docs PR to fix this?

Yes

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this further, would it work to just do

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index e16ef3580d..274ab41260 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -43,6 +43,7 @@ from pandas._libs.tslibs import (
     Timestamp,
     to_offset,
 )
+from pandas._libs.tslibs.offsets import DateOffset
 from pandas._typing import (
     AlignJoin,
     AnyArrayLike,
@@ -9073,6 +9074,13 @@ class NDFrame(PandasObject, indexing.IndexingMixin):
         if len(self.index) == 0:
             return self.copy(deep=False)
 
+        if isinstance(offset, DateOffset):
+            end_date = end = self.index[0] + offset
+            if end_date in self.index:
+                end = self.index.searchsorted(end_date, side="left")
+                return self.iloc[:end]
+            return self.loc[:end]
+
         offset = to_offset(offset)
         if not isinstance(offset, Tick) and offset.is_on_offset(self.index[0]):
             # GH#29623 if first value is end of period, remove offset with n = 1

? (I'm showing the diff of generic.py compared with upstream/main)

@DeaMariaLeon
Copy link
Member Author

would it work to just do

No... look:

>>> x = pd.Series([1] * 10, index=pd.bdate_range("2010-03-31", periods=10))
>>> result = x.first(pd.DateOffset(months=1))
>>> result
2010-03-31    1
2010-04-01    1
2010-04-02    1
2010-04-05    1
2010-04-06    1
2010-04-07    1
2010-04-08    1
2010-04-09    1
2010-04-12    1
2010-04-13    1
Freq: B, dtype: int64
>>> result2 = x.first("1M")
>>> result2
2010-03-31    1
Freq: B, dtype: int64

It doesn't need to start with an "end of the month" to fail.
The reason is that if offset is a DateOffset:

offset.is_on_offset(self.index[0])
True

If offset is string:

offset.is_on_offset(self.index[0])
False

We need to substract the Timedelta(1, unit= ___) - we could take the unit from the DateOffset.kwds or transform DateOffset to string, to start with. But I'm not sure if that's a good solution.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Apr 11, 2023

Isn't that fine though?

Because DateOffset(months=1) is different from MonthEnd(1) (and 'M', somewhat confusingly, means 'month end' - but there is a PR open to deprecate and rename that)

In [4]: pd.Timestamp.now() + pd.DateOffset(months=1)
Out[4]: Timestamp('2023-05-11 16:26:01.139005')

In [5]: pd.Timestamp.now() + pd.tseries.offsets.MonthEnd(1)
Out[5]: Timestamp('2023-04-30 16:26:14.195128')

I think it makes sense that

x.first(pd.DateOffset(months=1))

would take the rows within one calendar month of the first element, whereas x.first('1M') would only go until the first month-end

@DeaMariaLeon
Copy link
Member Author

Oh, OK!

but there is a PR open to deprecate and rename that

Does that mean that both (string offset and DateOffset) will return the same answer eventually?

Should I add examples with DateOffset in the docs, and new tests maybe?

last() will need to be fixed too

I don't need to speak @ the meeting, yaaay!

@MarcoGorelli
Copy link
Member

Does that mean that both (string offset and DateOffset) will return the same answer eventually?

Sorry I was unclear, what I meant was that 'M' will be renamed to 'ME' to make it clear that it means "month end" (rather than calendar month, like DateOffset(months=1))

Should I add examples with DateOffset in the docs, and new tests maybe?

Sounds good!

@mroeschke mroeschke added the Frequency DateOffsets label Apr 12, 2023
@DeaMariaLeon
Copy link
Member Author

Sorry I was unclear

No reason to apologise. Thanks for explaining.

I will add examples in the docs on another PR if that's fine.

@DeaMariaLeon DeaMariaLeon marked this pull request as ready for review April 13, 2023 11:51
@DeaMariaLeon
Copy link
Member Author

Would somebody very kind review this "annoying" PR @MarcoGorelli ? :)

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, thanks! We can do a follow-up for last as well after

@MarcoGorelli MarcoGorelli added this to the 2.1 milestone Apr 14, 2023
@DeaMariaLeon
Copy link
Member Author

DeaMariaLeon commented Apr 17, 2023

Nice one, thanks! We can do a follow-up for last as well after

My pleasure... Yes, we can do last() as well.

It's unclear to me why you only commented about using date_range on one test, but not the others. I mean, I use it on the other tests as well.
And when I removed date_range from line 122 (def test_first_with_date_offset) the assert_equal failed because freq=None on expected. So I removed it also from the input i. I’m not sure if that’s OK.

Before I removed freqat the input, I tried adding it to expected, but the methods I tried used an offset as well.

@MarcoGorelli : it looks green

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Just left a minor comment on the whatsnew note

I'll leave open a bit in case others (@jbrockmendel ? @phofl ?) have comments, else can merge


It's unclear to me why you only commented about using date_range on one test, but not the others. I mean, I use it on the other tests as well.

Just to not repeat myself :)

@@ -33,6 +33,7 @@ Bug fixes
- Bug in :meth:`ArrowDtype.__from_arrow__` not respecting if dtype is explicitly given (:issue:`52533`)
- Bug in :meth:`DataFrame.max` and related casting different :class:`Timestamp` resolutions always to nanoseconds (:issue:`52524`)
- Bug in :meth:`Series.describe` not returning :class:`ArrowDtype` with ``pyarrow.float64`` type with numeric data (:issue:`52427`)
- Fixed bug in :func:`first` when used with a DateOffset (:issue:`45908`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this close both issues? if so:

Suggested change
- Fixed bug in :func:`first` when used with a DateOffset (:issue:`45908`)
- Fixed bug in :func:`first` when used with a :class:`DateOffset` (:issue:`45908`)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had done this, but changed again later (per later request)

@DeaMariaLeon
Copy link
Member Author

Thanks @MarcoGorelli for the review (and the patience).

@@ -33,6 +33,7 @@ Bug fixes
- Bug in :meth:`ArrowDtype.__from_arrow__` not respecting if dtype is explicitly given (:issue:`52533`)
- Bug in :meth:`DataFrame.max` and related casting different :class:`Timestamp` resolutions always to nanoseconds (:issue:`52524`)
- Bug in :meth:`Series.describe` not returning :class:`ArrowDtype` with ``pyarrow.float64`` type with numeric data (:issue:`52427`)
- Fixed bug in :func:`first` when used with a :class:`DateOffset` (:issue:`45908`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no pd.first. Maybe :meth:DataFrame.first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change done

@@ -9070,7 +9071,15 @@ def first(self, offset) -> Self:
if len(self.index) == 0:
return self.copy(deep=False)

if isinstance(offset, offsets.DateOffset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't the offset = to_offset(offset) below make this check unnecessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue is that doing this check after to_offset would make it impossible to distinguish between MonthEnd, as that would also satisfy isinstance(offset, DateOffset)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they need to be handled differently because something like DateOffset(days=15) would always return True for if not isinstance(offset, Tick) and offset.is_on_offset(self.index[0]), and so would be shifted backward by offset.base (this is why in the linked issues, the result only has a single row for DateOffset input)

if end_date in self.index:
end = self.index.searchsorted(end_date, side="left")
return self.iloc[:end]
return self.loc[:end]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of this looks really similar to what we have below. can any of this be de-duplicated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@DeaMariaLeon
Copy link
Member Author

Friendly ping @MarcoGorelli @jbrockmendel

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait sorry, I gave a bad review here - this introduces an inconsistency between first('M') and first(MonthEnd()) (which should mean the same thing - reminder that 'M' means 'month end'):

In [1]: ser = pd.Series([1,2,3], index=pd.DatetimeIndex(['2020-01-31', '2020-02-01', '2020-03-01']))

In [2]: ser.first(MonthEnd())
Out[2]: 
2020-01-31    1
2020-02-01    2
dtype: int64

In [3]: ser.first('M')
Out[3]: 
2020-01-31    1
dtype: int64

@jbrockmendel is there a way to tell if an offset is DateOffset but not MonthEnd? I was a bit surprised to see

In [11]: isinstance(MonthEnd(), DateOffset)
Out[11]: True

. type(offset) is DateOffset would technically work, but I don't see it being done anywhere else

(as an aside, the fact that the tests still pass indicates the relatively poor test coverage that this function had to begin with)

# GH#29623 if first value is end of period, remove offset with n = 1
# before adding the real offset
end_date = end = self.index[0] - offset.base + offset
else:
end_date = end = self.index[0] + offset

# Tick-like, e.g. 3 weeks
if isinstance(offset, Tick) and end_date in self.index:
if isinstance(offset, Tick) or input_is_offset and end_date in self.index:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can never remember if a or b and c is evaluated as (a or b) and c or a or (b and c), could we add parantheses please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change that.

@DeaMariaLeon
Copy link
Member Author

this introduces an inconsistency between first('M') and first(MonthEnd()) (which should mean the same thing - reminder that 'M' means 'month end'):

I think the original issue #29623 was not a bug but first() (with a string) was modified to match the docstring: “For instance, ‘1M’ will display all the rows having their index within the first month.”

If first accepts a DateOffset then it should always handle this offset in a way which is consistent with DateOffset arithmetic. If the first timestamp in the index is '2020-01-31' then a MonthEnd() offset means going to the following month end:

from pandas.tseries.offsets import MonthEnd, Day
from pandas.tseries.frequencies import to_offset
from pandas.tseries.offsets import DateOffset

t = pd.Timestamp('2020-01-31')
t + MonthEnd()
Timestamp('2020-02-29 00:00:00’)

Adding for example 15 days means going 15 days forward

t + Day(15)
Timestamp('2020-02-15 00:00:00’)

Here we go to mid February with a 15-day offset. In this example, the timespan covered by an offset of month end is thus longer than that of an offset of 15 days. It would be completely unexpected for the user if the time period returned by first would be the opposite (ie shorter with a MonthEnd offset than with a 15-day offset).

For example:

i = pd.date_range('2020/1/31', periods=31, freq="D")
s = pd.Series(range(len(i)), index=i)

print(s.first(MonthEnd()).tail(5))
2020-02-24    24
2020-02-25    25
2020-02-26    26
2020-02-27    27
2020-02-28    28
Freq: D, dtype: int64

print(s.first(DateOffset(months=1)).tail(5))
2020-02-24    24
2020-02-25    25
2020-02-26    26
2020-02-27    27
2020-02-28    28
Freq: D, dtype: int64

print(s.first(‘1M').tail(5))
2020-01-31    0
Freq: D, dtype: int64

print(s.first(‘15D').tail(5))
2020-02-10    10
2020-02-11    11
2020-02-12    12
2020-02-13    13
2020-02-14    14
Freq: D, dtype: int64

So I think the options should be the following :

  • have first() behave with a DateOffset input with normal offset arithmetic and behave with a string input as it does now (what we have on this PR). But explain the difference in the docstring.
  • allow only string inputs to avoid any confusion and any change in the current string input behaviour.
  • change the behaviour with a string input so that it behaves consistently with month end semantics. The docstring could be changed to that effect to :'1M' will display all the rows having their index within the first month end following the first element in the index

I'm not sure if I have to (friendly) ping you :) @MarcoGorelli

@MarcoGorelli
Copy link
Member

I think the original issue #29623 was not a bug but first() (with a string) was modified to match the docstring:

I'm inclined to agree, and am pretty tempted to suggest reverting the PR which changed it

I'll add it to Wednesday's agenda

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 27, 2023
@DeaMariaLeon
Copy link
Member Author

I will work on this shortly

@DeaMariaLeon
Copy link
Member Author

Closing this one because Series/DataFrame.first() has been deprecated: #53419

@DeaMariaLeon DeaMariaLeon deleted the newbug branch June 28, 2023 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants